Phase 1: EDA & Initial Baseline Model

Abstract

Many consumers struggle receiving loan support from banks due to lacking credit history. Home Credit is a service whose goal is to provide loan opportunities for this underserved population. Failing to build and implement an accurate repayment detection method assumes major consequences. Missed financial interest unfolds if a loan is granted to consumers likely to default, Home Credit may not recoup the principal. This paper aims to address this issue by proposing a machine learning approach using Home Credit internal and external loan application and credit payment history data for automatic loan default detection. We introduce a simple and explainable logistic regression algorithm with the loan application data. Additionally, we explore more advanced machine learning and deep learning algorithms such as gradient boosting machines and neural networks to improve default classification. The results will show strong performance comparable to existing algorithms scoring near the top of the open-source Kaggle leaderboard.

Project Description

The main project objective is to build a machine learning classification algorithm with current and existing loan application and payment history data to determine if a new Home Credit loan applicant will default on a loan. This project will abide by the Cross Industry Standard Process for Data Mining (CRISP-DM) framework to build a valuable default detection system. The CRISP-DM framework consists of six revolving sections: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Since this modeling exercise is for research purposes only, the deployment step is ignored. The first phase of the project is to complete one iteration with available data to model and evaluate a regularized logistic regression model for default detection. The following file contains the exploratory data analysis and initial baseline model for the project "Identifying Home Credit Application Default Risk". Modeling has been performed using the data extracted from Home Credit Default Risk (HCDR) Kaggle Competition.

Background on Home Credit Group & the Data

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. Unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Application train & test datasets

The application dataset has the most information about the client: Gender, income, family status, education ...

Other datasets

Note: The other data set will be incorporated in phase 2 modeling.

Environment configuration

Libraries

Global options

Directories

Functions

EDA

Data is loaded into a Dictionary, and during the loading, initial statistics / information about the datasets are made available.

Data Load & Summary Statistics

Contents to DF and Cleanup

Exploratory Data Analysis will now be performed on the datasets. This will provide insights into the data, which will then be helpful for feature engineering and new feature creation.

Application Datasets - Exploration

Unique Values in Columns with Object Type

No anomaly in terms of unique values present in Test set but not in Train set. This information is also useful in determining the needed encoding for categorical values.

Missing Data

There are no columns with more than 70% of the data missing. 17 columns have missing data for more than 60% of the records.

There are no columns with more than 70% of the data missing. 17 columns have missing data for more than 60% of the records.

Value Distribution in TARGET Column

Value of 0 in the TARGET column indicates that the loan has been repaid, whereas the value of 1 indicates that the loan has not been repaid. From the above chart we can see that there is an imbalance in the distribution of class (TARGET) values, wherein the percentage of records in the Train dataset is very less for the loan non-repay compared to loan repaid.

In the following sections, further analysis on done on some of the interesting fields present in the dataset, and its relationship with the TARGET field.

Age of the Applicants

Credit Amount of the Applicants

Normalized Score from External Source 1

Normalized Score from External Source 2

Normalized Score from External Source 3

Region Rating Client

Region Rating Client with City

It is seen from the above charts for "Region Rating Client" and "Region Rating Client with City" seem to have more less same type of information.

Information About Building where Client Lives

Occupation of the Applicants

Income Source for the Applicants

Education of the Applicants

Housing Type of the Applicants

Family Status of the Applicants

Members Accompanying Clients

Organization Type Applying for Loan

Previous Application Dataset - Exploration

Unique Values in Columns with Object Type

Missing Data

2 columns have missing data for over 99% of the records. These columns can be ignored in the model.

Contract Types in Previous Application

Amount Requested in Previous Application & Amount Approved For

Purpose of Cash Loan in Previous Application

Contract Status in Previous Application

Reason for Previous Loan Rejection

Credit Card Balance Dataset - Exploration

Missing Data

Contract Status on the Previous Credit

Amount Balance on Previous Credit

Days Past Due (DPD) During Month on Previous Credit

Credit Card Limit on Previous Credit

Installment Payment Dataset - Exploration

Missing Data

Prescribed Installment Amount of Previous Credit on this Installment

POS Cash Balance Dataset - Exploration

Missing Data

Installments Left to Pay on Previous Credit

Contract Status During the Month

Bureau Dataset - Exploration

Unique Values in Columns with Object Type

Missing Data

Status of the Credit in Bureaus

Credit Type in Bureaus

Credit Amount Overdue

Current Credit Amount in Bureaus

Bureau Balance Dataset - Exploration

Missing Data

Bureau Balance Status

C means Closed, 0 means no DPD, X means Status Unknown; 1 means maximal did during month between 1-30; 2 means DPD 31-60; ....... 5 means DPD 120+ or sold or written off

Number of Months Balance - Relative to App Date

Further Analysis on Application Data

Correlation analysis and Pair-plot based visualization is performed

Correlation - Top 20 features against TARGET - Positive and Negative

Correlation - Heat Map - Top features against TARGET

There are a handful of explanatory features which show high amounts of collinearity and in phase 2 (once we increase our feature space) we will address moving some features which do not add additional value.

Pair Plot

Collinearity Issues

Modeling

Modeling Methods

The initial baseline model will be fit only with the application data set to predict credit default.

Data:

Validation strategy:

Features:

Data preprocessing:

Tuning strategy:

Algorithms:

Experiments:

Evaluation:

Split data

Generate a smaller sample for tuning

A standard 4 Core and 16 GB RAM does not have enough compute power to train a model of this size. Thus we will do an additional split with the training data in order for performing hyperparameter tuning.

The hyperparameter tuning will be completed using a random 10% of the training data. The hyperparameters will be chosen based on the subset generating the highest 5 fold validation area under the roc curve.

We will use the remaining 90% of the training data to perform 5 fold cross validation with the best hyperparameters from the tuning process to gather estimates of how the model will perform on unseen data.

Pipelines

For the initial model building phase, only the features from the application data set will be inputs into the model.

The data pipeline for these input features consists of four steps for the initial model building phase. There are two transformations applied to numeric features and two transformations applied to categorical features. The steps of the data preparation pipeline are listed below:

Pipeline architecture

Features

There are 13 different feature types within the data set ranging from loan information to client demographics. The different types of features are shown below along with the number of features in each category from the raw data.

Set the data pipeline

Hyperparameter Tuning

The hyperparameter tuning will be completed using a random 10% of the training data. The hyperparameters will be chosen based on the subset generating the highest 5 fold validation area under the roc curve.

We will use the remaining 90% of the training data to perform 5 fold cross validation with the best hyperparameters from the tuning process to gather estimates of how the model will perform on unseen data.

Candidate parameters

The logistic regression model is tuned to determine the optimal mixture of L1 and L2 regularization and the amount of regularization.

The random forest algorithm is tuned to determine the optimal number of trees, maximum number of features, maximum tree depth, minimum split size, and minimum samples in a leaf.

Lastly the XGBoost model is tuned to determine the number of trees, learning rate, maximum tree depth, and columns sampled by each tree.

Regularized Logistic Regression w/ all columns

Regularized logistic regression (LR) is a generalized linear model with a sigmoid activation function applied to generate probabilities that input data is of a certain class. Logistic regression is a parameterized model optimizing binary cross-entropy loss when learning model coefficients

Due to a large feature space, we will apply elastic net regularization which combines the L1 and L2 penalties to reduce model complexity and perform built in feature selection.

Logistic Regression Model Architecture

title

Learning a logistic regression model

The objective function for the learning a binary logistic regression model (log loss) is known as the binary cross-entropy loss (CXE) function.

This can be expressed directly in terms of perpendicular distances from X to the hyperplane theta as follows:

Tuning

Cross Validation Evalutation

Regularized Logistic Regression w/ loan features

For the second set of models, additional loan specific features are added including:

The results of the baseline model will be compared to the results of the model with the additional features added. Whichever feature set performs better will then be fit using more advanced tree based algorithms - random forest and xgboost.

We will assess feature importance after model fitting to identify the top features to select for modeling in phase 2 as additional data sources will be added to the algorithms tested in phase 1.

The baseline model results will be compared to the results from the three additional models using area under the roc curve, classification accuracy, precision, and recall.

Tuning

Cross Validation Evaluation

We have slightly better performance with the added features. We will use these features moving forward with the more advanced tree-based algorithms.

Random Forest w/ loan features

Random forests (RF) are a bagging ensemble modeling technique which trains a collection of decision trees and averages the prediction made by each tree. For each tree only a random subsample of the available features is selected for building the tree.

Random Forest Model Architecture

title

Learning a random forest classification model

Decision tree learning is a supervised machine learning approach whose goal is to recursively partition the input feature space into different zones to predict the target value. Decision trees use information gain to determine the next optimal split within the tree. Information gain is the expected reduction in entropy from splitting on a particular attribute.

Tuning

Cross Validation Evaluation

XGBoost w/ loan features

Extreme gradient boosting is a gradient boosting machine learning model optimized for faster convergence. A gradient boosting machine is a tree-based model which sequentially learns new trees to fit residuals from previous iterations. Thus, observations poorly predicted are sequentially weighted higher for future fits. This modeling technique consistently scores near the top of machine learning competitions.

Gradient Boosting Model Architecture

title

Learning a gradient boosting classification model

Decision tree learning is a supervised machine learning approach whose goal is to recursively partition the input feature space into different zones to predict the target value. Decision trees use information gain to determine the next optimal split within the tree. Information gain is the expected reduction in entropy from splitting on a particular attribute.

Gradient boosting then sequentially learns decision trees to model residuals from previous decision trees.

Tuning

Cross Validation Evaluation

Model Evaluation

The main metric to evaluate model performance is area under ROC curve. The ROC curve plots the relationship between sensitivity and specificity. Classifiers with an area under the curve close to 1 represent excellent performing classifiers. As the curve gets closer to 0.5 the less accurate the model.

The other metrics we evaluate are precision, recall, f1 score, classification accuracy, and the confusion matrix of the actuals versus the predicteds.

Regularized Logistic Regression w/ all columns

Feature Importance

Regularized Logistic Regression w/ loan features

Feature Importance

Random Forest w/ loan features

Feature Importance

XGBoost w/ loan features

Feature Importance

Kaggle Submission

Kaggle Results

Write up

Abstract

Many consumers struggle receiving loan support from banks due to lacking credit history. Home Credit is a service whose goal is to provide loan opportunities for this underserved population. Failing to build and implement an accurate repayment detection method assumes major consequences. Missed financial interest unfolds if a loan is granted to consumers likely to default, Home Credit may not recoup the principal. This paper aims to address this issue by proposing a machine learning approach using Home Credit internal and external loan application and credit payment history data for automatic loan default detection. We introduce a simple and explainable logistic regression algorithm with the loan application data. Additionally, we explore more advanced machine learning and deep learning algorithms such as gradient boosting machines and neural networks to improve default classification. The results will show strong performance comparable to existing algorithms scoring near the top of the open-source Kaggle leaderboard.

Introduction

The main project objective is to build a machine learning classification algorithm with current and existing loan application and payment history data to determine if a new Home Credit loan applicant will default on a loan. This project will abide by the Cross Industry Standard Process for Data Mining (CRISP-DM) framework to build a valuable default detection system. The CRISP-DM framework consists of six revolving sections: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Since this modeling exercise is for research purposes only, the deployment step is ignored. The first phase of the project is to complete one iteration with available data to model and evaluate a regularized logistic regression model for default detection. The following file contains the exploratory data analysis and initial baseline model for the project "Identifying Home Credit Application Default Risk". Modeling has been performed using the data extracted from Home Credit Default Risk (HCDR) Kaggle Competition.

Data

Data provided by Home Credit is available from the Kaggle website in CSV format. A brief description of the available data is mentioned below:

image.png

The main data available is loan application information at Home Credit. This data contains most information about the client – gender, income, family status, education, etc. Train and test versions of this data is available.

The train version of the file contains the “TARGET” field, which has a value of either “0” or “1”. “0” indicates that a loan has been repaid, whereas “1” indicates that a loan has not been repaid. The test version of the file does not contain the “TARGET” field. 307.5 thousand records with 122 columns are present in the train version of the data for model building, tuning, and validation, whereas 48.7 thousand records with 121 columns are present in the test version of the data for model evaluation. The six other sources of data supplement the loan application data to achieve higher accuracy.

The bureau data contains information about every client’s financial information from the various institutions. Prior loan information is present, and it has its own row in this file. 1.7m records with 17 columns are present in this file. The bureau balance data contains monthly data about previous credits in each bureau. A single credit can be present in multiple rows – one for each month of the credit length. 27.3 million records with 3 columns are present in this file.

The previous application file contains information about client’s previous loan in Home Credit. Various loan parameters from the past along with client information at the time of previous application is available. Each prior application has one row in this file. Each row is uniquely identified by a primary key – SK_ID_PREV. 1.7 million records along with 37 columns are present in this file. The point of sales cash balance, installment payments, and credit card balance for this data is also available in subsequent files.

The point-of-sale cash balance data contains monthly balance information maintained by clients in their previous loan in Home Credit. Each row contains one month of a credit balance, and a single credit can be present in multiple rows. 10 million records along with 8 columns are present in this data.

The installments payment data contains information about installment payments made by the clients in their previous loan in Home Credit. One row for every payment made and one row for every payment missed is present in this file. 13.6 million records along with 8 columns are present in this data.

The Credit Card Balance data contains information about monthly balances maintained by the clients in their previous credit card loans in Home Credit. 3.8 million records along with 23 columns are present in this data.

EDA Summary

The target value distribution has been explored on the training dataset. There is an imbalance between the number of records indicating that the loan was repaid and those indicating that the loan was not repaid.

The correlation of the fields against the target column has been explored and provided below:

The percentage of missing values has been checked on the training dataset. There are no columns with more than 70% missing data. However, there are 17 columns which have over 60% missing data.

Model building and evaluation

The initial baseline model is fit only with the application data set to predict credit default.

Data is split such that 80% is in training data and 20% is held out in testing data
5 fold cross validation will be used for tuning and estimating accuracy before evaluating on test data - see tuning section below.

A standard 4 Core and 16 GB RAM does not have enough compute power to train a model of this size. Thus we will do an additional split with the training data in order for performing hyperparameter tuning.

The hyperparameter tuning will be completed using a random 10% of the training data. The hyperparameters will be chosen based on the subset generating the highest 5 fold validation area under the roc curve.

We will use the remaining 90% of the training data to perform 5 fold cross validation with the best hyperparameters from the tuning process to gather estimates of how the model will perform on unseen data.

For phase one, there are three types of algorithms explored further detailed in the models section: elastic net logistic regression, random forest, xgboost.

There are four experiments tested with different subsets of features and algorithms:

The logistic regression model is tuned to determine the optimal mixture of L1 and L2 regularization and the amount of regularization.

The random forest algorithm is tuned to determine the optimal number of trees, maximum number of features, maximum tree depth, minimum split size, and minimum samples in a leaf.

Lastly the XGBoost model is tuned to determine the number of trees, learning rate, maximum tree depth, and columns sampled by each tree.

The main metric to optimize is area under ROC curve. The best performing model from the 5-fold cross validation will be fit on the entire training data and used to generate the first submission to the kaggle competition.

The additional metrics used to evaluate performance on the test data set are precision, recall, f1 score, classification accuracy, and the confusion matrix which compares the class predictions to the actual default status.

Feature Engineering and transformers

Features

The first models fit leverage only the application data features to understand a baseline accuracy without feature engineering applied. There are 13 different feature types within the data set ranging from loan information to client demographics. The different types of features are shown below along with the number of features in each category from the raw data.

New features

For the second set of models, additional loan specific features are added including:

The results of the baseline model will be compared to the results of the model with the additional features added. Whichever feature set performs better will then be fit using more advanced tree based algorithms - random forest and xgboost.

The baseline model results will be compared to the results from the three additional models using area under the roc curve, classification accuracy, precision, and recall.

Additional features from previous applications, previous bureau data, and prior payment history data sets will be added in phase 2 to build on the classification performance from phase 1.

We will assess feature importance after model fitting to identify the top features to select for modeling in phase 2 as additional data sources will be added to the algorithms tested in phase 1.

As the feature space grows rapidly from adding features from the other subsets, feature selection steps will be applied to remove features. Features with high missingness, collinear features, and features will little feature importance will be removed within pipeline steps to reduce the feature space and optimize the data to predict default.

Pipelines

For the initial model building phase, only the features from the application data set will be inputs into the model.

The data pipeline for these input features consists of four steps for the initial model building phase. There are two transformations applied to numeric features and two transformations applied to categorical features. The steps of the data preparation pipeline are listed below:

Pipeline architecture

Models

Regularized Logistic Regression w/ all columns

Regularized logistic regression (LR) is a generalized linear model with a sigmoid activation function applied to generate probabilities that input data is of a certain class. Logistic regression is a parameterized model optimizing binary cross-entropy loss when learning model coefficients

Due to a large feature space, we will apply elastic net regularization which combines the L1 and L2 penalties to reduce model complexity and perform built in feature selection.

Logistic Regression Model Architecture

title

Learning a logistic regression model

The objective function for the learning a binary logistic regression model (log loss) is known as the binary cross-entropy loss (CXE) function.

This can be expressed directly in terms of perpendicular distances from X to the hyperplane theta as follows:

Regularized Logistic Regression w/ loan features

For the second set of models, additional loan specific features are added including:

The results of the baseline model will be compared to the results of the model with the additional features added. Whichever feature set performs better will then be fit using more advanced tree based algorithms - random forest and xgboost.

We will assess feature importance after model fitting to identify the top features to select for modeling in phase 2 as additional data sources will be added to the algorithms tested in phase 1.

The baseline model results will be compared to the results from the three additional models using area under the roc curve, classification accuracy, precision, and recall.

Random Forest w/ loan features

Random forests (RF) are a bagging ensemble modeling technique which trains a collection of decision trees and averages the prediction made by each tree. For each tree only a random subsample of the available features is selected for building the tree.

Random Forest Model Architecture

title

Learning a random forest classification model

Decision tree learning is a supervised machine learning approach whose goal is to recursively partition the input feature space into different zones to predict the target value. Decision trees use information gain to determine the next optimal split within the tree. Information gain is the expected reduction in entropy from splitting on a particular attribute.

XGBoost w/ loan features

Extreme gradient boosting is a gradient boosting machine learning model optimized for faster convergence. A gradient boosting machine is a tree-based model which sequentially learns new trees to fit residuals from previous iterations. Thus, observations poorly predicted are sequentially weighted higher for future fits. This modeling technique consistently scores near the top of machine learning competitions.

Gradient Boosting Model Architecture

title

Learning a gradient boosting classification model

Decision tree learning is a supervised machine learning approach whose goal is to recursively partition the input feature space into different zones to predict the target value. Decision trees use information gain to determine the next optimal split within the tree. Information gain is the expected reduction in entropy from splitting on a particular attribute.

Gradient boosting then sequentially learns decision trees to model residuals from previous decision trees.

Experimental Results & Discussion

The best performing model was the XGBoost model (200 trees, 0.2 column sample by tree, 0.1 learning rate, 3 max tree depth) using all the application data features with additional loan domain specific features such as credit to income ratio.

There were a significant number of features which had no importance in the model. As expected from the EDA the most important features were the external source evaluations.

Conclusion

This project aims to implement an accurate repayment detection method for Home Credit, since inaccurate method will have major consequences. For phase 1, we only focused on the application dataset. We chose Logistic Regression with regularization on just the application data as our base model. We then used the knowledge gained from EDA on the application dataset to create 6 new features. Using these 6 new features, we then performed Logistic Regression, Random Forest and XGBoost. Our current best model uses XGBoost, wherein we noticed a 1.7% point increase in accuracy compared to the base model. We did Kaggle submission to the Home Credit Default Risk project, and we obtained a score of 0.7585 under the ROC. Our current rank is 4911 on the Kaggle leaderboard. As part of phase 2, we will use the data from the remaining datasets, by merging them onto application dataset. We will further create new features on the combined dataset, which will is expected to further improve accuracy.

References

Some of the material in this notebook has been adopted from Start Here: A Gentle Introduction

The architecture images are sourced from